{
 "cells": [
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "# 模型选择、欠拟合和过拟合",
   "id": "7a8d997bc0b93a"
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "### 通过多项式拟合来",
   "id": "a7e00d1b99dd4199"
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-08-25T07:52:23.444441Z",
     "start_time": "2024-08-25T07:52:21.797281Z"
    }
   },
   "cell_type": "code",
   "source": [
    "import math\n",
    "import numpy as np\n",
    "import torch\n",
    "from torch import nn\n",
    "from d2l import torch as d2l"
   ],
   "id": "initial_id",
   "execution_count": 1,
   "outputs": []
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "### 生成数据集\n",
    "\n",
    "给定$x$，我们将[**使用以下三阶多项式来生成训练和测试数据的标签：**]\n",
    "\n",
    "(**$$y = 5 + 1.2x - 3.4\\frac{x^2}{2!} + 5.6 \\frac{x^3}{3!} + \\epsilon \\text{ where }\n",
    "\\epsilon \\sim \\mathcal{N}(0, 0.1^2).$$**)\n",
    "\n",
    "噪声项$\\epsilon$服从均值为0且标准差为0.1的正态分布。\n",
    "在优化的过程中，我们通常希望避免非常大的梯度值或损失值。\n",
    "这就是我们将特征从$x^i$调整为$\\frac{x^i}{i!}$的原因，\n",
    "这样可以避免很大的$i$带来的特别大的指数值。\n",
    "我们将为训练集和测试集各生成100个样本。"
   ],
   "id": "467a655188b33df"
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-08-25T07:59:25.325653Z",
     "start_time": "2024-08-25T07:59:25.305621Z"
    }
   },
   "cell_type": "code",
   "source": [
    "# 生成多项式\n",
    "max_degree = 20  # 多项式的最大阶数\n",
    "n_train, n_test = 100, 100  # 训练和测试数据集大小，注意这里的n_test实际上是n_valid\n",
    "true_w = np.zeros(max_degree)  # 分配大量的空间\n",
    "true_w[0:4] = np.array([5, 1.2, -3.4, 5.6])\n",
    "\n",
    "# 噪音\n",
    "features = np.random.normal(size=(n_train + n_test, 1))\n",
    "np.random.shuffle(features)\n",
    "poly_features = np.power(features, np.arange(max_degree).reshape(1, -1))\n",
    "for i in range(max_degree):\n",
    "    poly_features[:, i] /= math.gamma(i + 1)  # gamma(n)=(n-1)!\n",
    "# labels的维度:(n_train+n_test,)\n",
    "labels = np.dot(poly_features, true_w)\n",
    "labels += np.random.normal(scale=0.1, size=labels.shape)"
   ],
   "id": "acb241c380de8e45",
   "execution_count": 2,
   "outputs": []
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "从生成的数据集中查看一下前2个样本， 第一个值是与偏置相对应的常量特征",
   "id": "9671abeaa65f5f01"
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-08-25T08:00:15.672138Z",
     "start_time": "2024-08-25T08:00:15.615420Z"
    }
   },
   "cell_type": "code",
   "source": [
    "# NumPy ndarray转换为tensor\n",
    "true_w, features, poly_features, labels = [\n",
    "    torch.tensor(x, dtype=torch.float32) for x in [true_w, features, poly_features, labels]\n",
    "]\n",
    "\n",
    "features[:2], poly_features[:2, :], labels[:2]"
   ],
   "id": "df05c5f1c541e556",
   "execution_count": 3,
   "outputs": []
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "### 对模型进行训练和测试\n",
    "\n",
    "首先让我们实现一个函数来评估模型在给定数据集上的损失。"
   ],
   "id": "c1e17f730ad4a236"
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-08-25T08:02:07.416953Z",
     "start_time": "2024-08-25T08:02:07.409590Z"
    }
   },
   "cell_type": "code",
   "source": [
    "def evaluate_loss(net, data_iter, loss):\n",
    "    \"\"\"评估给定数据集上模型的损失\"\"\"\n",
    "    metric = d2l.Accumulator(2)  # 损失的总和,样本数量\n",
    "    for X, y in data_iter:\n",
    "        out = net(X)\n",
    "        y = y.reshape(out.shape)\n",
    "        l = loss(out, y)\n",
    "        metric.add(l.sum(), l.numel())\n",
    "    return metric[0] / metric[1]"
   ],
   "id": "c959107cd27f403a",
   "execution_count": 4,
   "outputs": []
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "现在定义训练函数。",
   "id": "91e016d766c3ac29"
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-08-25T08:02:23.293077Z",
     "start_time": "2024-08-25T08:02:23.278320Z"
    }
   },
   "cell_type": "code",
   "source": [
    "def train(train_features, test_features, train_labels, test_labels,\n",
    "          num_epochs=400):\n",
    "    loss = nn.MSELoss(reduction='none')\n",
    "    input_shape = train_features.shape[-1]\n",
    "    # 不设置偏置，因为我们已经在多项式中实现了它\n",
    "    net = nn.Sequential(nn.Linear(input_shape, 1, bias=False))\n",
    "    batch_size = min(10, train_labels.shape[0])\n",
    "    train_iter = d2l.load_array((train_features, train_labels.reshape(-1, 1)),\n",
    "                                batch_size)\n",
    "    test_iter = d2l.load_array((test_features, test_labels.reshape(-1, 1)),\n",
    "                               batch_size, is_train=False)\n",
    "    trainer = torch.optim.SGD(net.parameters(), lr=0.01)\n",
    "    animator = d2l.Animator(xlabel='epoch', ylabel='loss', yscale='log',\n",
    "                            xlim=[1, num_epochs], ylim=[1e-3, 1e2],\n",
    "                            legend=['train', 'test'])\n",
    "    for epoch in range(num_epochs):\n",
    "        d2l.train_epoch_ch3(net, train_iter, loss, trainer)\n",
    "        if epoch == 0 or (epoch + 1) % 20 == 0:\n",
    "            animator.add(epoch + 1, (evaluate_loss(net, train_iter, loss),\n",
    "                                     evaluate_loss(net, test_iter, loss)))\n",
    "    print('weight:', net[0].weight.data.numpy())"
   ],
   "id": "55c2b9be64704bb9",
   "execution_count": 5,
   "outputs": []
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "### 三阶多项式函数拟合(正常)\n",
    "\n",
    "我们将首先使用三阶多项式函数，它与数据生成函数的阶数相同。\n",
    "结果表明，该模型能有效降低训练损失和测试损失。\n",
    "学习到的模型参数也接近真实值$w = [5, 1.2, -3.4, 5.6]$。"
   ],
   "id": "f326b8df8ec7527c"
  },
  {
   "metadata": {},
   "cell_type": "code",
   "source": [
    "# 从多项式特征中选择前4个维度，即1,x,x^2/2!,x^3/3!\n",
    "train(poly_features[:n_train, :4], poly_features[n_train:, :4], labels[:n_train], labels[n_train:])"
   ],
   "id": "3dfb0c06c3d06674",
   "execution_count": 6,
   "outputs": []
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "### 线性函数拟合(欠拟合)\n",
    "\n",
    "让我们再看看线性函数拟合，减少该模型的训练损失相对困难。\n",
    "在最后一个迭代周期完成后，训练损失仍然很高。\n",
    "当用来拟合非线性模式（如这里的三阶多项式函数）时，线性模型容易欠拟合。\n"
   ],
   "id": "2da370145bba5f1c"
  },
  {
   "metadata": {},
   "cell_type": "code",
   "source": [
    "# 从多项式特征中选择前2个维度，即1和x\n",
    "train(poly_features[:n_train, :2], poly_features[n_train:, :2], labels[:n_train], labels[n_train:])"
   ],
   "id": "ebd401766a10eedd",
   "execution_count": 8,
   "outputs": []
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "### 高阶多项式函数拟合(过拟合)\n",
    "\n",
    "现在，让我们尝试使用一个阶数过高的多项式来训练模型。\n",
    "在这种情况下，没有足够的数据用于学到高阶系数应该具有接近于零的值。\n",
    "因此，这个过于复杂的模型会轻易受到训练数据中噪声的影响。\n",
    "虽然训练损失可以有效地降低，但测试损失仍然很高。\n",
    "结果表明，复杂模型对数据造成了过拟合。"
   ],
   "id": "822bca663f11e4cd"
  },
  {
   "metadata": {},
   "cell_type": "code",
   "source": [
    "# 从多项式特征中选取所有维度\n",
    "train(poly_features[:n_train, :], poly_features[n_train:, :], labels[:n_train], labels[n_train:], num_epochs=1500)"
   ],
   "id": "56fdc84d11b7195b",
   "execution_count": 9,
   "outputs": []
  },
  {
   "metadata": {},
   "cell_type": "code",
   "execution_count": null,
   "source": "",
   "id": "e2e17e35b165c3b",
   "outputs": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
